LLM 25-Day Course - Day 13: Getting Started with Transformers Library

Day 13: Getting Started with Transformers Library

Hugging Face Transformers is a library that lets you use thousands of pre-trained models with just a few lines of code. Today we will quickly cover everything from installation to the core feature pipeline().

Installation and Basic Setup

# Installation
# pip install transformers torch

from transformers import pipeline

# Create a sentiment analysis pipeline
classifier = pipeline("sentiment-analysis")
result = classifier("This movie was really fun!")
print(result)
# [{'label': 'POSITIVE', 'score': 0.9998}]

# Process multiple sentences at once
texts = ["The weather is nice today", "The service was very rude"]
results = classifier(texts)
for text, res in zip(texts, results):
    print(f"{text} -> {res['label']} ({res['score']:.4f})")

Performing Various Tasks with pipeline()

A single pipeline() function can handle various NLP tasks including translation, summarization, and text generation. Internally, it automatically downloads the model and tokenizer.

from transformers import pipeline

# Text summarization
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
article = """
AI technology is rapidly advancing and affecting various industries.
In particular, large language models are demonstrating human-level
performance in text generation, translation, code writing, and other areas.
Companies are leveraging these technologies to improve work efficiency
and develop new services.
"""
summary = summarizer(article, max_length=50, min_length=10)
print(summary[0]["summary_text"])

# Translation (English -> French)
translator = pipeline("translation_en_to_fr", model="Helsinki-NLP/opus-mt-en-fr")
print(translator("Hello, how are you today?"))

Fine-Grained Control with AutoModel and AutoTokenizer

While pipeline() is convenient, loading the model and tokenizer directly gives you more fine-grained control.

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Move to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
print(f"Model is running on {device}.")

# Tokenize and pass directly to the model
inputs = tokenizer("Transformers library is amazing!", return_tensors="pt").to(device)
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits
    predicted_class = torch.argmax(logits, dim=-1).item()
    print(f"Predicted class: {predicted_class}")

Use pipeline() for rapid prototyping, and the AutoModel/AutoTokenizer combination when you need custom logic. Choose the appropriate approach based on the situation.

Today’s Exercises

Use pipeline("zero-shot-classification") to classify any news article text into the categories “Politics”, “Economy”, “Sports”, and “Technology”.
Use pipeline("text-generation") to generate 3 continuations for the prompt “The future of artificial intelligence is”. Utilize the num_return_sequences parameter.
Tokenize 3 English sentences with AutoTokenizer and print each sentence’s token count and token list. Use the model bert-base-multilingual-cased.